Baseball Analysis

Author

Injoong Kim and Stockton Aubrey

PROMPT

Baseball is a great sport, and we are big fans of the game, and with the aid of Lahman’s Database, we are selecting 50 of our favorite baseball players and determining if height and weight affects how many homeruns are hit over the last 10 years of their career and their performance on fielding. Additionally, we seek to ascertain whether a player’s handedness significantly impacts their batting average across the same duration.

The datasets used in this analysis include the People, Batting, and Fielding datasets from Lahman’s Database, encompassing crucial variables such as playerID, nameFirst, nameLast, weight, height, bats, debut, finalGame from the People dataset, playerID, yearID, PO(Putouts), A(Assist), E(Error) from the Fielding dataset,and the playerID, AB(At Bats), H(Hits), and HR(Homerun) variables from the Batting dataset. To ensure a comprehensive and accurate analysis, the 10 seasons under consideration will be the 10 seasons leading up to each player’s final game season, since rookie seasons often lack sufficient data to draw meaningful conclusions.

Through this rigorous statistical investigation, we aim to uncover potential relationships between physical attributes and players’ batting and fielding performance, contributing to the ongoing discourse on the multifaceted dynamics shaping success in professional baseball.

IMPORT

#Read the 'People' Dataset and select the variables we need from the 'People' Dataset 
people <- subset(People, select = c(playerID, nameFirst, nameLast, weight, height, bats, debut, finalGame))
tail(people, n=10)
       playerID nameFirst  nameLast weight height bats debut finalGame
21001 tallice99    Cedric    Tallis     NA     NA <NA>  <NA>      <NA>
21002 bronfch99   Charles  Bronfman     NA     NA <NA>  <NA>      <NA>
21003 somerch99   Charles    Somers     NA     NA <NA>  <NA>      <NA>
21004 feenech99      Chub    Feeney     NA     NA <NA>  <NA>      <NA>
21005  bushge99 George W.      Bush     NA     NA <NA>  <NA>      <NA>
21006 paysojo99      Joan    Payson     NA     NA <NA>  <NA>      <NA>
21007 galbrjo99      John Galbreath     NA     NA <NA>  <NA>      <NA>
21008 mcshejo99      John  McSherry    351     75 <NA>  <NA>      <NA>
21009 weyerle99       Lee     Weyer    258     78 <NA>  <NA>      <NA>
21010 palerst99     Steve   Palermo    175     74 <NA>  <NA>      <NA>
#Read the 'Batting' Dataset and select the variables we need from the 'Batting' Dataset 
batting <- subset(Batting, select = c(playerID, yearID, AB, H, HR))
tail(batting, n=10)
        playerID yearID  AB   H HR
113790 zuverge01   1957  23   3  0
113791 zuverge01   1958   9   2  0
113792 zuverge01   1959   0   0  0
113793 zwilldu01   1910  87  16  0
113794 zwilldu01   1914 592 185 16
113795 zwilldu01   1915 548 157 13
113796 zwilldu01   1916  53   6  1
113797  zychto01   2015   0   0  0
113798  zychto01   2016   0   0  0
113799  zychto01   2017   0   0  0
#Read the 'Fielding' Dataset and select the variables we need from the 'Fielding' Dataset 
fielding <- subset(Fielding, select = c(playerID, yearID, PO, A, E))
tail(fielding, n=10)
        playerID yearID  PO  A  E
151498 zuverge01   1958   4 19  0
151499 zuverge01   1959   0  4  0
151500 zwilldu01   1910  45  2  3
151501 zwilldu01   1914 340 15 14
151502 zwilldu01   1915   3  0  0
151503 zwilldu01   1915 356 20  8
151504 zwilldu01   1916  11  0  0
151505  zychto01   2015   0  3  0
151506  zychto01   2016   0  0  1
151507  zychto01   2017   1  3  0

TIDY

#Select the 50 players who are our favorite players from the 'People' Dataset 
players <- people |>
  filter((nameFirst == "Mike" & nameLast == "Trout") |
           (nameFirst == "Bryce" & nameLast == "Harper") |
           (nameFirst == "Matt" & nameLast == "Kemp") |
           (nameFirst == "Ichiro" & nameLast == "Suzuki") |
           (nameFirst == "Derek" & nameLast == "Jeter") |
           (nameFirst == "Babe" & nameLast == "Ruth") |
           (nameFirst == "Ted" & nameLast == "Williams") |
           (nameFirst == "Barry" & nameLast == "Bonds") |
           (nameFirst == "Manny" & nameLast == "Machado") |
           (nameFirst == "Fernando" & nameLast == "Tatis") |
           (nameFirst == "George" & nameLast == "Brett") |
           (nameFirst == "Albert" & nameLast == "Pujols") |
           (nameFirst == "Ryan" & nameLast == "Howard") |
           (nameFirst == "Buster" & nameLast == "Posey") |
           (nameFirst == "Pete" & nameLast == "Rose" & debut == "4/8/1963") |
           (nameFirst == "David" & nameLast == "Ortiz") |
           (nameFirst == "Mike" & nameLast == "Schmidt") |
           (nameFirst == "Paul" & nameLast == "Goldschmidt") |
           (nameFirst == "Todd" & nameLast == "Helton") |
           (nameFirst == "Matt" & nameLast == "Holliday") |
           (nameFirst == "Alex" & nameLast == "Rodriguez") |
           (nameFirst == "Sammy" & nameLast == "Sosa") |
           (nameFirst == "Mark" & nameLast == "McGwire") |
           (nameFirst == "Mickey" & nameLast == "Mantle") |
           (nameFirst == "Carlos" & nameLast == "Beltran") |
           (nameFirst == "Carl" & nameLast == "Yastrzemski") |
           (nameFirst == "Adrian" & nameLast == "Beltre") |
           (nameFirst == "Hank" & nameLast == "Aaron") |
           (nameFirst == "Willie" & nameLast == "Mays") |
           (nameFirst == "Prince" & nameLast == "Fielder") |
           (nameFirst == "Lou" & nameLast == "Gehrig") |
           (nameFirst == "Manny" & nameLast == "Ramirez") |
           (nameFirst == "Wade" & nameLast == "Boggs") |
           (nameFirst == "Vladimir" & nameLast == "Guerrero" & playerID == "guerrvl01")|
           (nameFirst == "Miguel" & nameLast == "Cabrera") |
           (nameFirst == "Joe" & nameLast == "Mauer")|
           (nameFirst == "Yadier" & nameLast == "Molina")|
           (nameFirst == "Chipper" & nameLast == "Jones")|
           (nameFirst == "Joey" & nameLast == "Votto")|
           (nameFirst == "Freddie" & nameLast == "Freeman")|
           (nameFirst == "Roberto" & nameLast == "Clemente")|
           (nameFirst == "Jose" & nameLast == "Altuve")|
           (nameFirst == "Harmon" & nameLast == "Killebrew")|
           (nameFirst == "Ty" & nameLast == "Cobb")|
           (nameFirst == "Jackie" & nameLast == "Robinson")|
           (nameFirst == "Tony" & nameLast == "Gwynn" & debut == "7/19/1982")|
           (nameFirst == "Cal" & nameLast == "Ripken" & debut == "8/10/1981")|
           (nameFirst == "Shin-Soo" & nameLast == "Choo")|
           (nameFirst == "Adrian" & nameLast == "Gonzalez")|
           (nameFirst == "Carl" & nameLast == "Crawford")|
           (nameFirst == "Mookie" & nameLast == "Betts")|
           (nameFirst == "Aramis" & nameLast == "Ramirez")
  )
players <- with(players,  players[order(nameFirst) , ])
players
    playerID nameFirst    nameLast weight height bats      debut  finalGame
3  beltrad01    Adrian      Beltre    220     71    R 1998-06-24 2018-09-30
18 gonzaad01    Adrian    Gonzalez    215     74    L 2004-04-18 2018-06-10
36 pujolal01    Albert      Pujols    235     75    R 2001-04-02 2022-10-04
40 rodrial01      Alex   Rodriguez    230     75    R 1994-07-08 2016-08-12
37 ramirar01    Aramis     Ramirez    205     73    R 1998-05-26 2015-10-03
41  ruthba01      Babe        Ruth    215     74    L 1914-07-11 1935-05-30
7  bondsba01     Barry       Bonds    185     73    L 1986-05-30 2007-09-26
20 harpebr03     Bryce      Harper    210     75    L 2012-04-28 2022-10-04
35 poseybu01    Buster       Posey    213     73    R 2009-09-11 2021-10-03
13 crawfca02      Carl    Crawford    230     74    L 2002-07-20 2016-06-03
50 yastrca01      Carl Yastrzemski    175     71    L 1961-04-11 1983-10-02
4  beltrca01    Carlos     Beltran    215     73    B 1998-09-14 2017-10-01
25 jonesch06   Chipper       Jones    210     76    B 1993-09-11 2012-10-03
34 ortizda01     David       Ortiz    230     75    L 1997-09-02 2016-10-02
24 jeterde01     Derek       Jeter    195     75    R 1995-05-29 2014-09-28
45 tatisfe01  Fernando       Tatis    185     71    R 1997-07-26 2010-07-04
46 tatisfe02  Fernando       Tatis    217     75    R 2019-03-28 2021-10-03
15 freemfr01   Freddie     Freeman    220     77    L 2010-09-01 2022-10-05
8  brettge01    George       Brett    185     72    L 1973-08-02 1993-10-03
1  aaronha01      Hank       Aaron    180     72    R 1954-04-13 1976-10-03
27 killeha01    Harmon   Killebrew    195     72    R 1954-06-23 1975-09-26
44 suzukic01    Ichiro      Suzuki    175     71    L 2001-04-02 2019-03-21
39 robinja02    Jackie    Robinson    195     71    R 1947-04-15 1956-09-30
30 mauerjo01       Joe       Mauer    225     77    L 2004-04-05 2018-09-30
48 vottojo01      Joey       Votto    220     74    L 2007-09-04 2022-08-16
2  altuvjo01      Jose      Altuve    166     66    R 2011-07-20 2022-10-04
16 gehrilo01       Lou      Gehrig    200     72    L 1923-06-15 1939-04-30
28 machama01     Manny     Machado    218     75    R 2012-08-09 2022-10-05
38 ramirma02     Manny     Ramirez    225     72    R 1993-09-02 2011-04-06
32 mcgwima01      Mark     McGwire    215     77    R 1986-08-22 2001-10-07
22 hollima01      Matt    Holliday    240     76    R 2004-04-16 2018-10-01
26  kempma01      Matt        Kemp    225     76    R 2006-05-28 2020-09-25
29 mantlmi01    Mickey      Mantle    195     71    B 1951-04-17 1968-09-28
9  cabremi01    Miguel     Cabrera    267     76    R 2003-06-20 2022-10-04
42 schmimi01      Mike     Schmidt    195     74    R 1972-09-12 1989-05-28
47 troutmi01      Mike       Trout    235     74    R 2011-07-08 2022-10-05
5  bettsmo01    Mookie       Betts    180     69    R 2014-06-29 2022-10-05
17 goldspa01      Paul Goldschmidt    220     75    R 2011-08-01 2022-10-04
14 fieldpr01    Prince     Fielder    275     71    L 2005-06-13 2016-07-18
11 clemero01   Roberto    Clemente    175     71    R 1955-04-17 1972-10-03
23 howarry01      Ryan      Howard    250     76    L 2004-09-01 2016-10-02
43  sosasa01     Sammy        Sosa    165     72    R 1989-06-16 2007-09-29
10  choosh01  Shin-Soo        Choo    205     71    L 2005-04-21 2020-09-27
49 willite01       Ted    Williams    205     75    L 1939-04-20 1960-09-28
21 heltoto01      Todd      Helton    220     74    L 1997-08-02 2013-09-29
12  cobbty01        Ty        Cobb    175     73    L 1905-08-30 1928-09-11
19 guerrvl01  Vladimir    Guerrero    235     75    R 1996-09-19 2011-09-28
6  boggswa01      Wade       Boggs    190     74    L 1982-04-10 1999-08-27
31  mayswi01    Willie        Mays    170     70    R 1951-05-25 1973-09-09
33 molinya01    Yadier      Molina    225     71    R 2004-06-03 2022-10-05

TRANSFORM

# Join players and batting dataset for and filter the last 10 seasons of the players
battingData <- left_join(players, batting, by = join_by(playerID))|>
  dplyr::select(yearID, playerID, nameFirst, nameLast, weight, height, bats, AB, H, HR)|>
  group_by(playerID)|>
  filter(yearID >= max(yearID)-9)
battingData
# A tibble: 501 × 10
# Groups:   playerID [50]
   yearID playerID  nameFirst nameLast weight height bats     AB     H    HR
    <int> <chr>     <chr>     <chr>     <int>  <int> <fct> <int> <int> <int>
 1   2009 beltrad01 Adrian    Beltre      220     71 R       449   119     8
 2   2010 beltrad01 Adrian    Beltre      220     71 R       589   189    28
 3   2011 beltrad01 Adrian    Beltre      220     71 R       487   144    32
 4   2012 beltrad01 Adrian    Beltre      220     71 R       604   194    36
 5   2013 beltrad01 Adrian    Beltre      220     71 R       631   199    30
 6   2014 beltrad01 Adrian    Beltre      220     71 R       549   178    19
 7   2015 beltrad01 Adrian    Beltre      220     71 R       567   163    18
 8   2016 beltrad01 Adrian    Beltre      220     71 R       583   175    32
 9   2017 beltrad01 Adrian    Beltre      220     71 R       340   106    17
10   2018 beltrad01 Adrian    Beltre      220     71 R       433   118    15
# ℹ 491 more rows
#Get the total homeruns of each player
bodyHR <- battingData|>
  summarize(sumHR = sum(HR))|>
  left_join(people) |>
  dplyr::select(playerID, nameFirst, nameLast, weight, height, sumHR)
Joining with `by = join_by(playerID)`
bodyHR
# A tibble: 50 × 6
   playerID  nameFirst nameLast weight height sumHR
   <chr>     <chr>     <chr>     <int>  <int> <int>
 1 aaronha01 Hank      Aaron       180     72   313
 2 altuvjo01 Jose      Altuve      166     66   195
 3 beltrad01 Adrian    Beltre      220     71   235
 4 beltrca01 Carlos    Beltran     215     73   199
 5 bettsmo01 Mookie    Betts       180     69   252
 6 boggswa01 Wade      Boggs       190     74    54
 7 bondsba01 Barry     Bonds       185     73   388
 8 brettge01 George    Brett       185     72   167
 9 cabremi01 Miguel    Cabrera     267     76   146
10 choosh01  Shin-Soo  Choo        205     71   159
# ℹ 40 more rows
#Get the batting average of each player
handBatAvg <- battingData|>
  summarize(BatAvg =  round(sum(H)/sum(AB), 3))|>
  left_join(people) |>
  dplyr::select(playerID, nameFirst, nameLast, bats, BatAvg)
Joining with `by = join_by(playerID)`
handBatAvg
# A tibble: 50 × 5
   playerID  nameFirst nameLast bats  BatAvg
   <chr>     <chr>     <chr>    <fct>  <dbl>
 1 aaronha01 Hank      Aaron    R      0.286
 2 altuvjo01 Jose      Altuve   R      0.313
 3 beltrad01 Adrian    Beltre   R      0.303
 4 beltrca01 Carlos    Beltran  B      0.278
 5 bettsmo01 Mookie    Betts    R      0.294
 6 boggswa01 Wade      Boggs    L      0.304
 7 bondsba01 Barry     Bonds    L      0.314
 8 brettge01 George    Brett    L      0.293
 9 cabremi01 Miguel    Cabrera  R      0.285
10 choosh01  Shin-Soo  Choo     L      0.266
# ℹ 40 more rows
# Join players and fielding dataset for and filter the last 10 seasons of the players
fieldingData<- left_join(players, fielding, by = join_by(playerID))|>
  dplyr::select(yearID, playerID, nameFirst, nameLast, weight, height, PO, A, E)|>
  group_by(playerID)|>
  filter(yearID >= max(yearID)-9)
fieldingData
# A tibble: 647 × 9
# Groups:   playerID [50]
   yearID playerID  nameFirst nameLast weight height    PO     A     E
    <int> <chr>     <chr>     <chr>     <int>  <int> <int> <int> <int>
 1   2009 beltrad01 Adrian    Beltre      220     71   103   224    14
 2   2010 beltrad01 Adrian    Beltre      220     71   138   285    19
 3   2011 beltrad01 Adrian    Beltre      220     71    93   208    11
 4   2012 beltrad01 Adrian    Beltre      220     71    95   209     8
 5   2013 beltrad01 Adrian    Beltre      220     71    93   232    14
 6   2014 beltrad01 Adrian    Beltre      220     71   144   206    12
 7   2015 beltrad01 Adrian    Beltre      220     71   105   267    17
 8   2016 beltrad01 Adrian    Beltre      220     71   104   301    10
 9   2017 beltrad01 Adrian    Beltre      220     71    50   135     5
10   2018 beltrad01 Adrian    Beltre      220     71    53   145    10
# ℹ 637 more rows
#Get the fielding percentage of each player
fieldingPercent <- fieldingData|>
  summarize(FieldingPercentage =  round((sum(PO)+sum(A))/(sum(PO)+sum(A)+sum(E)), 3))|>
  left_join(people) |>
  dplyr::select(playerID, nameFirst, nameLast, weight, height, FieldingPercentage)
Joining with `by = join_by(playerID)`
fieldingPercent
# A tibble: 50 × 6
   playerID  nameFirst nameLast weight height FieldingPercentage
   <chr>     <chr>     <chr>     <int>  <int>              <dbl>
 1 aaronha01 Hank      Aaron       180     72              0.986
 2 altuvjo01 Jose      Altuve      166     66              0.984
 3 beltrad01 Adrian    Beltre      220     71              0.964
 4 beltrca01 Carlos    Beltran     215     73              0.989
 5 bettsmo01 Mookie    Betts       180     69              0.989
 6 boggswa01 Wade      Boggs       190     74              0.966
 7 bondsba01 Barry     Bonds       185     73              0.982
 8 brettge01 George    Brett       185     72              0.982
 9 cabremi01 Miguel    Cabrera     267     76              0.996
10 choosh01  Shin-Soo  Choo        205     71              0.981
# ℹ 40 more rows

VISUALIZE / MODEL

Impact of Weight and Height on Homeruns

#Between Weight and Homeruns
bodyHRweight <- ggplot(bodyHR, aes(x=sumHR, y=weight))+
  geom_point()+
  labs(title = "Comparison of Homeruns and Players' Weight",
       x ="Number of Home Runs",
       y="Players' Weight")+
  theme_minimal(
  )

bodyHRweight

#Between Height and Homeruns
bodyHRheight <- ggplot(bodyHR, aes(x=sumHR, y=height))+
  geom_point()+
  labs(title = "Comparison of Homeruns and Players' Height",
       x ="Number of Home Runs",
       y="Players' Height")+
  theme_minimal(
  )

bodyHRheight

#Between Weight, Height, and Homeruns
bodyHRboth <- ggplot(bodyHR, aes(x=height, y=weight, color = sumHR, size = sumHR))+
  geom_point(alpha=0.7)+
  labs(title = "Comparison of Homeruns and Players' Height and Weight",
       x ="Players' Height",
       y ="Players' Weight")+
  theme_minimal(
  )
bodyHRboth

#3D version of bodyHRboth
bodyHRboth3D <- plot_ly(x=bodyHR$height,y=bodyHR$weight,z=bodyHR$sumHR) |>
  layout(scene = list(
    xaxis = list(title = "Height"),
    yaxis = list(title = "Weight"),
    zaxis = list(title = "Total Number of Homeruns")))

bodyHRboth3D
No trace type specified:
  Based on info supplied, a 'scatter3d' trace seems appropriate.
  Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Batting Average by Handedness

#Batting Average by Handedness
handedAvg <- ggplot(handBatAvg, aes(y=BatAvg, x= bats, fill=bats))+
  geom_boxplot()+
  labs(title = "Batting Average by Players' Handedness",
       x =NULL,
       y="Players' Career Batting Average")+
  theme_minimal(
  )

handedAvg

meanR <- round(mean(handBatAvg$BatAvg[handBatAvg$bats == 'R'], na.rm = TRUE),3)
meanR
[1] 0.289
meanL <- round(mean(handBatAvg$BatAvg[handBatAvg$bats == 'L'], na.rm = TRUE),3)
meanL
[1] 0.299
meanB <- round(mean(handBatAvg$BatAvg[handBatAvg$bats == 'B'], na.rm = TRUE),3)
meanB
[1] 0.286
#Number of right-handed, left-handed, and both-handed hitters in the dataset
numbats <- table(handBatAvg$bats)
numbats

 B  L  R 
 3 20 27 

Impact of Weight and Height on Fielding Percentage

#Between Weight and Fielding Percentage
bodyFPweight <- ggplot(fieldingPercent, aes(x=FieldingPercentage, y=weight))+
  geom_point()+
  labs(title = "Comparison of Fielding Percentage and Players' Weight",
       x ="Fielding Percentage",
       y="Players' Weight")+
  theme_minimal(
  )

bodyFPweight

#Between Height and Fielding Percentage
bodyFPheight <- ggplot(fieldingPercent, aes(x=FieldingPercentage, y=height))+
  geom_point()+
  labs(title = "Comparison of Fielding Percentage and Players' Height",
       x ="Fielding Percentage",
       y="Players' Height")+
  theme_minimal(
  )

bodyFPheight

#Between Weight, Height, and Homeruns
bodyFPboth <- ggplot(fieldingPercent, aes(x=height, y=weight, color = FieldingPercentage, size = FieldingPercentage))+
  geom_point(alpha=0.7)+
  labs(title = "Comparison of Fielding Percentage and Players' Height and Weight",
       x ="Players' Height",
       y ="Players' Weight")+
  theme_minimal(
  )
bodyFPboth

#3D version of bodyFPboth
bodyFPboth3D <- plot_ly(x=fieldingPercent$height,y=fieldingPercent$weight,z=fieldingPercent$FieldingPercentage) |>
  layout(scene = list(
    xaxis = list(title = "Height"),
    yaxis = list(title = "Weight"),
    zaxis = list(title = "FieldingPercentage")))

bodyFPboth3D
No trace type specified:
  Based on info supplied, a 'scatter3d' trace seems appropriate.
  Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Communicate

Impact of Weight and Height on Homeruns

We have created insightful plots with GGplot and plot_ly to provide a visually compelling representation of our questions. Through these visualizations, we aim to find the relationship between weight, height, and the number of homeruns of the selected 50 baseball players during their careers.

We expected that taller and heavier hitters would record higher numbers of the sum of homeruns during their careers. However, our expectation was not met from the dataset.

It is evident from our analysis that the correlation between weight and height does not invariably dictate one’s ability to achieve a high homerun count; however, it is noteworthy to acknowledge that there may be instances where these factors play a role in influencing such outcomes. When we increase our samples, there could be a relationship between weight, height, and the number of homeruns. Since there are only 50 samples, we might not be able to find a relationship.

Batting Average by Handedness

From the boxplot we can see that the batting average of left-handed hitter is the highest. The mean of the batting average of left-handed is 0.299. And the batting average of right-handed hitter is the second highest. The mean of the batting average of right-handed is 0.290. For the lowest batting average, both-handed is lowest. The mean of the batting average of both-handed is 0.286.

We expected that left-handed’s batting average would be highest and both-handed’s batting average would be lowest. And indeed, our expectations were correct.

There are a lot of left-handed hitters these days, but since there have been a lot more right-handed hitters, left-handed hitters may be able to target pitchers better because pitchers have relatively little experience with left-handed hitters. And left-handed hitters are more likely to hit infield than right-handed hitters because the first base is closer.

A both-handed hitter can play with his right hand when his opponent is a left-handed pitcher, and with his left hand when he is a right-handed pitcher, but that doesn’t mean they are better than highly trained left-handed hitters, or highly trained right-handed hitters. In fact, as long as they don’t use both hands well with real talent, they’re likely to just be an ambiguous both-handed hitter.

Impact of Weight and Height on Fielding Percentage

We expected that shorter and heavier hitters would record lower fielding percentage during their careers. However, our expectation was not met from the dataset.

After creating plots to compare the effect of weight and height on fielding percentage, we can analyze the correlation to some degree. Based on these plots, there is no correlation.